This particular lesson is written in R in the so called R-markdown format. It is assumed that you have R and Rstudio installed. In this case you will be able to follow all steps by running the code in the grey boxes beneath. For further information on getting R and Rstudio see the Prerequisites-section of the book R for Data Science.
This lesson is the first concrete example of how to interact with a specific API and we pick up exactly where we left in the previous lesson What is an API?. The last thing we did in that lesson was to ask the Royal Danish Library’s Newspaper API to tell us how many articles mentions “internet”. The answer was returned in the JSON-format, which we will save for later, since the Newspaper API can also return answers in CSV-format, which will be the case of this example. CSV is short for Comma Separated Values and is a way of storing data in a raw text format. CSV-files are easily handled by most programming languages and especially R. The main focus of this lesson will therefore be on constructing an request URL to the Newspaper API as explained in the previous chapter.
As a general rule of thumb it is always best to examine and understand the data that you’re trying to extract and the service which stores them and how they make them available before you dive into the API. This process will be total dependent on the specific case and in our case with the Newspaper API it involves diving into what this collection contains. In the following section we will dive into a very short survey of the Danish Newspaper Collection’s history to fully understand the .
The collections exist because legal deposit of published material has been required by law in Denmark since 1697. In the light of this Danish Newspapers has been collected and stored for the future. This led to a lot of physical paper and the library began to photograph the individual pages of each newspaper and store it on microfilm instead. Then from 2014-2017 these microfilms were digitized. This involved a computer running a segmentation algorithm, which runs through all these now digital pages and identifying which headers belonged to which paragraphs thus forming articles. Along with this the computer also recognized the text thus making it searchable. The process of recognizing the text is called Optical Character Recognition(OCR). These processes were not precise and especially not on the older newspapers, which cause a lot of “misreading” in the OCR-text and in the segmentation of articles. The result is an ALTO-file, which is short for Analyzed Layout and Text Object. This is highly structured data format which stores information on where the individual OCR-recognised words are placed on the page as well as which article they belong to. The best way to imagine an ALTO-file is as a file, that contains the digital layout as recognised by segmentation and OCR. The combination of the ALTO-file and the digital photograph of the newspaper pages forms a pdf-file that consist of two “layers”. One which is the actual picture of the newspaper pages and another layer containing the OCR text making the pdf-file searchable.
Visualization of the digitization process of the newspapers - in the segmentation and OCR the colors indicates which text parts has been identified as belonging to each other
The result is of course a lot of pdf-files, but there is also a lot of metadata around these pdf files. For example we have the time of publication, the place of publication and which newspaper it is. All this data is presented and made available through an graphical user interface that normal users can interact with. In the case of the newspaper collection this platform is called Mediestream.
Let’s use the graphical user interface on a specific case. In this case we want to find articles from the correspondent sent out by the newspaper “Dagbladet”. These articles should be on internal affairs in France and in Paris and about the politician Charles de Rémusat in the year 1873. The screenshot below shows how the this search is performed in Mediestream. Red circles marks demarcation-elements in the interface that are of particular interest:
Example search free text search, specification of newspaper, as well as defining time range in the selector tool in the graphical user interface
The top circle is the free text search field. This is where we define that the word “korrespondent”, “paris” and “rémusat” must be present in the OCR text of the article that we are looking for. The next circle is where we define the time period of interest. In this case by pointing and clicking through months and years eventually defining from the 1. January 1873 to 31. December 1873 - in other words the entire year of 1873. The last circle is where we have defined that we only have interest in hits in the newspaper “Dagbladet”. The results in 9 hits which means that 9 articles(identified as such in the segmentation process) meet our requirements.
This exact search could have been performed entirely from the free
text search field using more advanced search codes. Behold this search:
This results in the exact same. 9 hits from the newspaper “Dagbladet”. So what has been done differently? Notice the free text search field - here we have appended “py:1873” to our search before. This is an “advanced” search code setting the publication year(py) to 1873. Notice how the time selector is blank - this is because it haven’t been used. Furthermore the search code “familyId:dagbladetkoebenhavn1851” has been added, which says that we are only interested in results from the newspaper “Dagsbladet”. Since “Dagsbladet” is a fairly popular name for at newspaper(imagine something like “Daily News”) we are using a unique id for this particular newspaper. All the newspapers in Mediestream has been given unique ids to avoid ambiguity. Thus we end up with a search string that looks like this:
korrespondent AND paris AND rémusat AND py:1873 AND familyId:dagbladetkoebenhavn1851
In order to extract raw data from the newspaper API we need to be able to define the data that we are interested with these kind of advanced search strings. It is a good idea to test the search strings in Mediestream and once you are happy with the amount of hits you take your advanced search string to the API. For more help on constructing search strings see the page for search advice in Mediestream, where you’ll also find a link to a list of the aforementioned unique ids for the newspapers.
One important thing to add before venturing on is the limitations in access to the newspaper collection due to copyright. The is because of the newspapers being at the library due to the legal deposit of published material. Thus the some of the material is still under copyright meaning that you can only see newspapers older than 100 years and in order to extract data from the newspaper API the material must be older than 140 years.
Before venturing on to extracting data from the newspaper API with a search string let’s create a string that has more than 9 hits by expanding the time range and removing rémusat, in order to get articles containing paris and korrespondent in the periode 1870 to 1875:
korrespondent AND paris AND py:[1870 TO 1875] AND familyId:dagbladetkoebenhavn1851
This search gives us 644 hits. Now we have a some what large body of material and we wan’t to employ some kind og digital method on them. This can’t be done in the graphical user interface of Mediestream. We need to turn our focus to the API connected to Mediestream
In order to extract the 644 as raw data in a machine readable format we use the Swagger interface for the newspapers API. A Swagger interface is an interactive documentation of an API. This means that you can both try the APIs functionality and get information about which metadata and data is exported. Furthermore the interface shows how you can limit your search. The existence of a Swagger interface (or similiar) is a good sign for data extraction, because it means that the creators have thought about disseminating the APIs functionalities.
Navigating to the Newspaper
API Swagger UI will lead you to the following landing page: What we see in
the blue boxes is all the different services that the API offers - and a
text explaining summarizing what these services do. These are called
endpoints of the API. In this case we will focus on the first service
described in the top blue box, the endpoint:
/aviser/export/fields - Export data from old newspapers at http://mediestream.dk/
Clicking on this box expands the view and clicking “Try it out” makes the documentation interactive, which is what we want:
Expanding the /aviser/export/fields endpoint
The next step is to paste in the search string from before into the
query field, which replaces the placeholder search:
Beneath the query field there is a list of all the fields that the API
can return. It is a good idea to read through this list as it will give
an idea of which kind of analytical questions can be examined through
the data. E.g there is a field called “fulltext_org”, which contains:
The original OCR text for the article.. So this tells
us that we will be able to perform text mining on our
data. Another thing field amongst many is the timestamp which is the
publication date for the articles, so adding a temporal perspective in
any analysis will also be possible. Moving further down the swagger page
leads to our next fields og particular focus:
The first red circle is
the “max”-value. This is where you define how many articles you want to
have returned by the API. Remember that our current search return 644
articles and the defaul value is 10. By changing this value to “-1” as
shown above, we are returned all articles that the search query
matches.
The next step is to change the default format value from JSON to CSV, as
shown above! The last step is to press the blue “Activate” button.
The result is the following:
Here we are given an request url (marked by the red circle). This URL
holds a CSV-file that contains our 644 articles. Now that we have this
request URL we are ready to import the articles into R. But before we do
this we will focus on the Request-URL as given by swagger above
In order to bette understand this url we break it into indvidual pieces:
| Explanation | URL segement |
|---|---|
| Base URL | http://labs.statsbiblioteket.dk |
| API - endpoint | /labsapi/api/aviser/export/fields |
| Query: | ?query=korrespondent%20AND%20paris%20AND%20py%3A%5B1870%20TO%201875%5D%20 AND%20familyId%3Adagbladetkoebenhavn1851 |
| Which fields to export: | &fields=link&fields=recordID&fields=timestamp&fields=pwa&fields=cer&fields=fulltext_org&fields=pageUUID &fields=editionUUID&fields=titleUUID&fields=editionId&fields=familyId&fields=newspaper_page &fields=newspaper_edition&fields=lplace&fields=location_name&fields=location_coordinates |
| Max number of rows to return | &max=-1 |
| Structure | &structure=header&structure=content |
| File format to return: | &format=CSV |
In R, one works with packages each adding numerous functionalities to the core functions of R. In this example, the relevant packages are:
Documentation for each package:
https://www.tidyverse.org/packages/
https://lubridate.tidyverse.org/
*https://ggplot2.tidyverse.org/
library(tidyverse)
library(lubridate)
By copying the request URL from the Newspaper API swagger interfaces we can paste into the following function:
dagsbladet_paris <- read_csv("http://labs.statsbiblioteket.dk/labsapi/api/aviser/export/fields?query=korrespondent%20AND%20paris%20AND%20py%3A%5B1870%20TO%201875%5D%20AND%20familyId%3Adagbladetkoebenhavn1851&fields=link&fields=recordID&fields=timestamp&fields=pwa&fields=cer&fields=fulltext_org&fields=pageUUID&fields=editionUUID&fields=titleUUID&fields=editionId&fields=familyId&fields=newspaper_page&fields=newspaper_edition&fields=lplace&fields=location_name&fields=location_coordinates&max=-1&structure=header&structure=content&format=CSV")
## Rows: 644 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): link, recordID, fulltext_org, pageUUID, editionUUID, titleUUID, e...
## dbl (4): pwa, cer, newspaper_page, newspaper_edition
## dttm (1): timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A brief summary of the current situation is that CSV is short for Comma Separated Values that is a way of structuring a dataset in plain text. CSV files are structured in columns separated by commas and in rows separated by lines. Each row in the data correspond to identified articles by the segmentations-process during the digitisation process of the newspapers.
In the output from the read_csv-function R tells us
which columns are present in the dataset and what type of data it has
recognised in the column’s rows. Most of them are “col_character()”,
which means the rows in the column contains textual data (character
signs). Others have the “col_double()”, which means the rows in the
column contains numbers. This is a question of datatypes, which can be
very important when coding, but without the scope of this lesson.
Remember that we had meta data on when these articles were published? Let’s do a relatively simple examination of the timely dispersion of these articles containing “korrespondent” and “paris” in the period 1870 to 1875 from the newspaper Dagsbladet